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Abstract 

In high energy physics experiments, calorimetric data reconstruction requires a suitable clustering technique in order 
to obtain accurate information about the shower characteristics such as position of the shower and energy deposition. 
Fuzzy clustering techniques have high potential in this regard, as they assign data points to more than one cluster, 
thereby acting as a tool to distinguish between overlapping clusters. Fuzzy c-means (FCM) is one such clustering 
technique that can be applied to calorimetric data reconstruction. However, it has a drawback: it cannot easily identify 
and distinguish clusters that are not uniformly spread. A version of the FCM algorithm called dynamic fuzzy c- 
means (dFCM) allows clusters to be generated and eliminated as required, with the ability to resolve non-uniformly 
distributed clusters. Both the FCM and dFCM algorithms have been studied and successfully applied to simulated 
data of a sampling tungsten-silicon calorimeter. It is seen that the FCM technique works reasonably well, and at the 
same time, the use of the dFCM technique improves the performance. 
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1. Introduction 

The human mind can easily grasp concepts like imprecision, uncertainty, partial truth and approximation. Con- 
ventional (hard) computing, i.e., computing with traditional electronic computations has limitations, as it only sees 
in terms of 'black' and 'white', or more accurately, zeroes and ones. Though this way of computing is sufficient 
for a large number of tasks, it cannot handle a problem for which there is not yet enough information to calculate 
definitively what the answer is. Soft computing, on the other hand, uses nature as a role model, allowing computers to 
take uncertainty and imprecision into account, in much the same way the human mind does. Soft computing surfaced 
as a formal computer science area of study [1] in the early 1990's, and acts as the emerging field of 'computational 
intelligence' essentially adding 'intelligence' to computing techniques J2l • It is an essential tool for a good decision 
making model, and has been used in a wide variety of disciplines from bioinformatics to aeronautical engineering 
to image processing. Computational intelligence has also found its way to detector physics, especially for high en- 
ergy physics experiments, and has been used to determine detector performances Q3|, |4j, |5|], as well as extract physical 
information of interacting particles 

In high energy physics experiments, one of the essential steps in the extraction of physical information from 
particle detectors is to reconstruct the characteristics of the incoming particles. A calorimeter is normally used for 
accurate characterization of an incoming particle in terms of its position and energy. Depending on the structure, 
calorimeters can be categorized into two types: homogeneous and sampling. A homogeneous calorimeter is made up 
of a single block of material that acts as an absorber as well as an active medium from which the signal can be collected. 
On the other hand, sampling calorimeters are segmented in the longitudinal as well as in the transverse directions, 
consisting of layers of absorber and active detector combinations. Each segment or cell acts as an independent detector. 
To reconstruct the physical information one needs to identify the group of hit cells associated with an incoming 
particle, and determine the most probable position of the particle and its energy. This is done by the use of a clustering 
algorithm. Additionally, the clustering techniques can also be used to classify different sets of events 1 10]. 

A number of clustering algorithms exist in the literature, and can be employed for high energy physics data 
reconstruction. Algorithms like the contiguity method and cellular automata, search neighborhoods of cells would 
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not be as useful for distinguishing between overlapping clusters found in high density environments. Some other 
methods, like deterministic annealing, require parameters that depend very strongly on the data pattern, and would 
not serve well as a generic clustering algorithm. Hard clustering techniques such as local maxima search, connected- 
cell search and k-means clustering simply assign a data point to a cluster. A data point either lies in a cluster or it 
does not, and so, overlapping clusters are hardly distinguishable. In fuzzy clustering, on the other hand, data points 
are represented by degrees of membership, i.e., they lie within clusters to varying degrees. The term 'fuzzy' is used 
because an observation may in fact lie in more than one cluster simultaneously, as is the case with many high energy 
physics applications. In this article, a type of fuzzy clustering called fuzzy c-means (FCM) 

is studied and 

applied to simulated data of a sampling tungsten-silicon calorimeter. A modification to the FCM algorithm [14], that 
allows the clustering to be carried out dynamically was found to resolve particle clusters well specifically in the case 
of nonuniformly distributed clusters. 

This article is organized as follows. Section [2] discusses the FCM algorithm including possible modifications. 
One such modification is the dynamic FCM (dFCM) technique, which is discussed in Section [3] Details of the 
calorimetric configuration are outlined in Section |4] The power of FCM to distinguish between overlapping clusters 
is demonstrated in Section by performing clustering on two photon clusters obtained by neutral pions of varying 
energies. The ability of dFCM to resolve non-uniformly distributed clusters with ease unlike FCM, is demonstrated 
in Section|6] followed by its application to calorimetric data. The paper concludes in SectionQwith a summary and a 
remark on future perspectives. 



2. Fuzzy Clustering and the FCM Algorithm 



Fuzzy clustering is a technique in which the allocation of data points to clusters follows fuzzy logic, thereby 
providing the means to separate overlapping clusters. Fuzzy clustering is appropriate for applications in detectors, as 
showers generated by particles may have profiles that are continuous or overlapping on the detector planes. Fuzzy 
c-means (FCM) is a commonly used fuzzy clustering algorithm 1 11, 12, 13]. It is a version of the k-means algorithm 



that incorporates fuzzy logic, so that each point has a weak or strong association to the cluster, determined by the 
inverse distance to the center of the cluster. The centers obtained using the FCM algorithm are based on the geometric 
locations of the data points. The FCM algorithm has been used in B15I1 to track gamma rays in segmented detectors, 
and in |3] to find clusters in the preshower detector in high energy heavy ion experiments. 

For a set of data points X, FCM seeks to minimize the objective function - the weighted within groups sum of 
squared errors J m : 



JJU, V;X)=± Yj(u ik ) m \\x k - y,|| 2 , 



(1) 



k=\ i=l 



where V = (vi, V2, . . . , Vc) is a vector of unknown cluster centers or centers, U consists of the memberships u lk of the 



k point in the / cluster, and 



' x is any inner product norm. The fuzzy factor m normalizes and fuzzifies 



the memberships so that their sum is 1. A value of 2 implies linear normalization, whereas when m is closer to 1, the 
cluster center closest to the point is given much more weight than the others, making the algorithm similar to k-means. 
Optimization of J m is based on iteration through certain necessary conditions. Following the FCM Theorem 01211 If 
Dik — \\xk - v,-|| > for all i and k, then (U,V) may minimize J„, only if, when m > 1, 



where 1 < i < C, 1 < k < n, and 



(2) 



(3) 



Alternating optimization (AO) is the iteration technique that is most often used in this algorithm. It simply loops 
through one cycle of estimates for V,-\ — > U, — > V, until some error criteria is reached. An error threshold e can 
be specified so that the error criteria is ||V ( _i - Vi\\ err < e. Defuzzification, that is, determining which point lies in 
which cluster, can be done by looking at the memberships of a point associated with each cluster. A point belongs to a 



2 



cluster if its corresponding membership is the maximum out of the point's memberships in all of the clusters. Figure[T] 
shows the flow chart of the FCM algorithm. The algorithm runs for a specified number of clusters. This number can 
be varied, and the best set of clusters can be selected by means of a validity index. 



-Specify the number of clusters, 

- Initialize cluster centers 

- initialize memberships 



Update memberships according to the cluster centers 



Update cluster centers according to the memberships 




No 



Calculate validity index 



C 



Select the cluster centers with the best 
validity index 



Figure 1: The flow chart for the fuzzy c-means (FCM) algorithm. 



2.1. Validity Indices 

A validity index seeks to determine how well the data is represented by the selected clusters. There are a number of 
validity indices available in the literature II 1311 . The Xie-Beni index II 1611 is one such validity index that is widely used 
because of its dependence on both memberships as well as geometric distances. More explicitly, the index depends 
on the distances between data points and the centers of clusters as well as the distance between cluster centers. Other 
indices are based on other aspects of the data. For instance, the partition coefficient depends only on the memberships 
of the data: 

where n is the number of data points, c is the number of clusters, and (Xjj is the membership of the j' h data point in the 
j' h cluster. It varies from 1 jc to 1. This index is to be maximized. In order to remove the dependence of PC on c, the 
Modified Partition Coefficient was defined as 11711 : 

MPC(c) = 1 C — (1 - PC), (5) 

c - 1 

which varies from to 1 . This index is to be maximized as well. The Xie-Beni index is chosen for the present study, 
though it may be noted that any validity index preferred by the experimenter may be employed. The Xie-Beni index 
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vxb is defined as follows: 

v XB (U, V; X) = — - — — - — — . (6) 

"QmjnJ||v; - vj\\}) 

i*j 

It is essentially the ratio of the total variation of the cluster centers and memberships of the observations in the 
groupings to the separation between the cluster centers, and minimization of the index leads to better clusters. The 
larger the separation between clusters and the more closely packed the points in the cluster, the better the clustering. 
This index has no upper bound. 

2.2. Modifications to the F CM Algorithm 

Over the years, a number of application-specific modifications have been made to the FCM algorithm. As FCM 
clustering is based on the Euclidean distance between data samples, it gives each data point and each dimension (or 



feature) the same importance. A modification to FCM using feature-weight learning was looked into IU8lll9ll . however, 



in the present study, each of the three dimensions considered in the clustering should be given equal importance. The 



drawbacks of FCM include the dependence on the fuzzy factor m, which may vary from one data set to other [20], 
and the fact that it treats outliers in the same way it treats data points lying in the bulk of the data. In order to address 
these concerns, the suppressed FCM algorithm 02111 was looked into. This algorithm prizes the biggest memberships 
and suppresses the others with a weighting factor. 

A modified version of the FCM algorithm IU4II has been developed, which has the main advantage that it dynam- 
ically finds clusters as data streams in, deleting and generating clusters as needed. The decision making for valid 
clusters is made through the use of a validity index. Theoretically, the method does not require a maximum number 
of clusters, only a minimum, which gives it an edge over the energy-weighted modification. It also adapts to the data 
pattern at each instant. This modified algorithm is called dynamic Fuzzy c-Means (dFCM), which is discussed in 
detail in the next section. 



3. The dynamic Fuzzy c-Means (dFCM) Algorithm 

The dFCM algorithm Q 

is a modification of the fuzzy c-means algorithm, allowing cluster centers to be adap- 
tively updated as data points keep streaming in. If a new cluster is formed, then a new cluster center is automatically 
generated. Figure |2]gives a detailed flow chart of the algorithm. The working of the algorithm is as follows: 

1. To start with, we assume that we already have a few of the incoming data points at hand. From these data 
points, we can roughly estimate - but not restrict ourselves to - the range of the incoming data. Initially, a few 
parameters are specified, namely, the membership threshold, fi, the FCM error criteria e, and the bounds of 
the variable C - the number of clusters. The initial number of points is taken to be low, just to kick-start the 
clustering and give the algorithm a brief idea about the values it is dealing with. Unless the initial number of 
points is so large that hints of clusters can already be seen, the clustering wont be affected. 

2. If the minimum value of the variable C is C m ,„ then, initially, C mm cluster centers are generated uniformly within 
the input space. Note that this number must be greater than or equal to 2 as the unit cluster is not allowed in the 
FCM algorithm. Once the initial cluster centers have been specified, the memberships of the initial data points 
are found using Equation [2] 

3. The data points are now allowed to stream in, one by one. When a new data point arrives, its memberships in 
the clusters present are calculated. If the maximum membership value associated with this point is greater than 
or equal to the membership threshold jj, then a simple AO update takes place, as outlined in Section [2] This 
means that the data point belongs to at least one of the clusters to an extent that matches or exceeds /i. 

4. Let C be the number of clusters present at a given time. If the maximum membership of the data point falls 
below ix, then the validity of the present number of cluster centers C is compared with each of the validities of 
L = C- 2toC + 2 clusters in the following manner: 

• The stored values of the cluster centers are checked to see if L clusters have been generated and updated 
at a previous time. If so, then the old values are updated using the FCM algorithm, and the validity index 
is evaluated. 
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Figure 2: The flow chart of the dynamic FCM (dFCM) algorithm. 



• If L cluster centers have not been generated before, then the stored values of L — 1 centers are used, and an 
L' h center is generated by slightly perturbing the data point. The AO update is carried out, and the validity 
index is evaluated. 

• The cluster centers that generate the best value of the validity index are taken. 
5. The process continues until the data points stop streaming in. 

3.1. The Membership Threshold (p): 

The purpose of the membership threshold, p, is to avoid evaluating cluster validity each time a data point comes 
in. That is, if the data point lies to a specified satisfactory extent in a cluster, then it is not necessary to check if other 
clusters are better. However, the role of p may be allowed to change, depending on what the specific application of the 
dFCM algorithm may require. The experimenter is free to set the conditions that need to be met in order to evaluate 
cluster validity. For instance, the condition that is specified in this work is whether or not the data point lies within a 
cluster to a satisfactory extent. In another application, it may be that validity index evaluation need only take place if 
the new updated centers are significantly different from the old ones. The factor p can then be used in a condition that 
may look like: 

\\V old -V new \\>p, (7) 

i.e., if the distance between the two sets of coordinates is greater than p, then validity is evaluated. In a situation like 
this, the AO update would take place automatically once a data point streams in, without any check on its membership. 
Once the centers are updated, p can be used to check whether they are significantly different from the previous ones, 
and therefore whether or not validity should be evaluated. This prevents redundant calculations. 
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3.2. Cluster Centers and Validity 

Cluster centers are kept track of because validity indices may change as data points keep streaming in. At a certain 
point in time, two cluster centers may be sufficient, whereas at a later point, the algorithm may call for four cluster 
centers. The reverse may be true as well, and so, in order to keep the window open for cluster number possibilities, 
C - 2 to C + 2 clusters are checked, where C is the actual number of cluster centers at a particular point in time. This 
window is flexible, and may be modified to be wider or narrower, depending on the preference of the experimenter. 
Note that the values of the cluster centers are made to fall within the initially specified range. However, the maximum 
number of cluster centers need not be specified if the experimenter prefers to keep the upper limit open. Any validity 
index favored by the experimenter may be employed, as the algorithm does not specifically depend on the type of 
validity index used. 

3.3. Running Time of dFCM 

The running time for the dFCM algorithm coupled with the Xie-Beni index was calculated and found to be 0(an 2 ), 
where n is the size of the data, and a is the maximum number of AO updates required, explained below. The worst- 
case scenario was taken, in which a new cluster is generated for every incoming point, and there are no clusters to 
begin with. The running time had been obtained as follows: Simple arithmetic and logical operations take constant 
amounts of time. Unless a calculation that uses these operations explicitly depends on the size of the data or on a 
search, then the calculation also takes a constant amount of time. As each data point streams in, a check is performed 
to see if it lies within the established clusters to a satisfactory extent (step 3). Since we're assuming that each new 
point acts as a new cluster, each possible 'check' with the present clusters (i.e. in this case, the data points themselves) 
takes place. When the first point streams in, no check is performed, as there are no data points present. When the 
second point streams in, only 1 check is performed. When the next point streams in, 2 checks are performed, and so 
on. The total number of checks is represented by the following series: 

n(n — 1) 

1+2 + 3 + ---+W-1 = — . (8) 

This implies a running time of order 0(n 2 ) for the various checks. When a new cluster is generated, validity indices 
are calculated for L = C- 2toC + 2, and an AO update is performed (step 4). 

The running time of the update depends on the number of data points already present, say k, and the number of 
iterations required before the error criteria is reached, say a^+u as it may differ for each case. The subscript k + 1 
indicates that the (k + Yf h data point is streaming in. Therefore, the running time of the update can be considered to be 
0{kak+\). The running time of the validity index calculations can be taken as a constant^ = C + 2-(C-2) = 4) times 
k, as the validity index evaluation sums over all the data points present at the given time only once, and is performed 
an I number of times. The total running time associated with each number contributing to Equation|8]can be thought 
of as: 0{kaic+i) + O(lk) of which Oikak+i) dominates. Since the order of the total running time of the AO updates can 
be thought of as: 

1 • a% + 2 • a$ + 3 • a$ -) — • + (n — 1) • a n , (9) 

where we define, a = max(a2, • • • , Therefore, the total running time of the dFCM algorithm can be considered 
to be: 0(an 2 ). 

3.4. Applications of dFCM 

The dynamic fuzzy c-means clustering technique can be applied in situations that involve online analysis of stream- 
ing data, in which adaptive information is required, or in which the data to be clustered is not uniform. Most dynamic 
versions of available clustering techniques are application specific However, dFCM is a generic algorithm 

that can be applied to a number of different situations. For instance, [ 14] discusses a potential application as an adap- 
tive rule extraction technique for fuzzy associative memories, in the field of soft computing. In this paper, dFCM has 
been applied to high energy particle physics data reconstruction. Another potential application may lie in time-series 
analysis and prediction, or updating databases. 



6 



4. Calorimeter Concept and Design 



As a demonstration of the applicability of the FCM and dFCM clustering techniques, a sampling calorimeter 
consisting of tungsten and silicon layers was simulated using the GEANT4 package I24I l25ll . A similar tungsten- 
silicon calorimeter has been developed and used by the CALICE Collaboration 11261, 12711 . A sampling calorimeter 



consists of several planes of absorber material along with active medium planes. A design of a tungsten-silicon 
calorimeter was made with 20 layers, where each layer consisted of a 3 mm thick tungsten plate followed by a 0.3 mm 
thick silicon sensor. Being a high-Z element, tungsten converts high energy photons or electrons into electromagnetic 
showers. The majority of the photons that are emitted in high energy collisions are decay photons from 7r°. One of 
the major goals is to reconstruct n° and their energy from the measured photon showers. The decay angle of the two 
emitted photons decreases with the increase of the energy of 7r°. The distance between the two emitted photons has 
been calculated for a placement of the calorimeter at a distance of 350 cm from the interaction vertex, as shown in 
Figure [3] As the 7r° energy increases, the distance between the two photons decreases. The reconstruction of the 
photon showers needs to be accurate in order to obtain the shower positions of the photons and deposited energy. In 
order to measure jt° energy accurately, tracking of the shower in different layers is needed. Therefore, the position 
resolution of the detectors in the sensitive medium has to be suitable. 




„ 100 120 140 160 180 200 

71° Energy (GeV) 



Figure 3: Opening angle (top panel) between two photons emitted from the decay of n° as a function of n° energy. The minimum distance between 
two photons at a distance of 350 cm from the vertex is shown in the bottom panel. As the n° energy increases, both the opening angle of the two 
decaying photons and the distance between them decrease. The dashed curves are fits to the data points, which agree with theoretically calculated 
values. 



The calorimeter, shown in Figure 2] was designed keeping all of the requirements in mind. The longitudinal 
shower profile of the calorimeter, as shown in Figure |5] was studied in order to decide the granularity of the detector 
layers. From this figure, it was found that the maximum of the shower in terms of energy deposition occurs around 
layer 4 for photon energy of around 5 GeV, whereas for a photon of 50 GeV, the shower maximum is around layer 
8. Therefore, high granular planes were placed in the region of the shower maxima for accurate measurements. For 
the purpose of simulation, only twenty layers were considered, out of which three layers (4, 8 and 12) were made 
of highly granular silicon pads, each with dimension of 0.1 cm X 0.1 cm. The rest of the layers were made up of 
1 cm x 1 cm silicon pad detectors. The three high granular planes help to determine the shower position with high 
accuracy and thus help in tracking the path of the incoming particles. The shower generation, energy deposition and 
the characteristics of the shower produced by photons have been studied for the calorimeter using both FCM and 
dFCM techniques. 
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5. FCM applied to photon clusters 

In order to demonstrate the practicability of the FCM algorithm described in Section [2] data for a single n° decay 
into two photon clusters at different energies was generated. Figure |6]displays the shower profiles of photons emitted 
from 7T° at three different energies: 10 GeV, 50 GeV, and 100 GeV. The left column shows the event display of the 
three cases. The two photon showers move closer to one another with the increase of n° energy, thereby causing an 
overlap of clusters. The FCM algorithm was used to cluster the data points on each plane of the detector, taking the x 
-position, y - position and the energy depositions of each pad into account. The fuzzy factor m was varied, and a value 
of 1.8 was found to be suitable in resolving the individual clusters. This value has been also been recommended in 
J^t]. For all FCM clustering performed, the following parameter values were taken: 

m=1.8 and e = 0.01. 

The results of the clustering for layer 8 in terms of lateral s hower profiles are shown in the right column of Figure |6] 
The solid dots in the left column indicate the cluster centers found on each of the three planes. The lines represent the 
photon tracks. It is seen that despite the large extent of overlap, the photon clusters are clearly identified by the FCM 
algorithm, and the photon paths are successfully tracked using the three layers. The cluster positions, the tracks and the 
energy depositions of the photons have been used to reconstruct the mass of the n . This is a test of the working of the 
reconstruction algorithm, where clustering of the hit pads plays an important role. Figure Qa-c) shows the invariant 
mass reconstruction using the FCM algorithm, which essentially indicates the quality of n° reconsturction for three 
different energies. Figure|7Jd) shows the n° mass, reconstructed from the decay of photons for n of different energies. 
The statistical errors indicated on this figure are the rms values of the invariant mass distributions. The granularity of 
the detector and the limitations of the clustering routine limit reconstruction of n°. As seen from Figure Eld), up to 
100 GeV energy, there is a reasomable mass reconstruction, beyond which it deviates. 



8 



350r 



300- 



250- 



O 
Oh 

l> 150- 

Q 

pa 



100- 



50- 




8 10 12 14 16 18 20 

Layer # 



Figure 5: The longitudinal shower profile for energy deposition by photons of different energies in various layers of the sampling calorimeter. 



6. Application of dFCM to Calorimetric Data 

The dFCM clustering technique as discussed in Section [3] works equally well for the calorimetric data discussed 
in the previous section. Since FCM is a static algorithm that takes all data points into account at one time, it is 
sometimes unable to identify clusters with non-uniform data patterns, like those encountered in a real experimental 
scenario where a variety of particles hit the calorimeter. The dynamic version of the FCM algorithm works better to 
resolve very close clusters. 

In order to determine the spatial resolving power of both FCM and dFCM, a sample of eight clusters on the fourth 
layer of the calorimeter was generated. Figure [8]shows the profiles of the eight clusters. Each data point represents 
the position and energy deposition on each of the small pads (0.1 cm X 0.1 cm). First, the FCM clustering technique 
was performed, with an input cluster range of 2 to 10. The Xie-Beni index was used to select the best set of cluster 
centers. The minimum value of the index was obtained for two cluster centers, as the data appears to be distributed 
in two major groups of four clusters each. The solid points of the left panel of Figure [8] shows the two cluster centers 
obtained using the FCM algorithm and the Xie-Beni index on the raw data. The FCM clustering could not resolve the 
individual clusters, as they appear to be within a group. 

Next, the dFCM algorithm was applied to the same data, and the Xie-Beni index was used to find the number of 
clusters. For all dFCM clustering performed, the following parameter values were taken: 

m = 1.8, e = 0.01, fi = 0.8, Number of initial points = 10. 

The right panel of Figure [8] shows the result of clustering using the dFCM algorithm. As can be seen, there are some 
isolated scattered data points that affected the number of clusters obtained. The energy depositions of the scattered 
data points were low, and below that of one MIP (minimum ionizing particle). Thus, a cut of one MIP on a cell level 
was applied so that scattered points were not considered before clustering. With this cut, we observe that the dFCM 
method was able to resolve and identify all the eight individual clusters. This was possible because the dFCM method, 
as was emphasized earlier, allows clustering to be performed on various stages of data as the data points stream in, 
and adapts to the structure of data at each instant. The dFCM algorithm is more suitable to resolve clusters that are 
not uniformly distributed. 

In order to demonstrate the power of the dFCM algorithm, around 1000 events of eight clusters were synthesized 
with varying strengths as well as positions. For each event, the dFCM algorithm was applied. In Figure [9] the output 
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Figure 6: Profiles of neutral pions of energies 10 GeV (a-b), 50 GeV (c-d), and 100 GeV (e-f) decaying to two photons. Left: Longitudinal shower 
profiles and the cluster centers found by the FCM algorithm on the 4th, 8th and 12th layers. Right: the lateral shower profiles of the 8th layer and 
the clusters found by the FCM algorithm. i r. 
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Figure 7: Invariant mass distributions of two photons in the sampling calorimeter using FCM clustering routine for n° energies of: (a) 30 GeV, 
(b) 50 GeV, (c) 100 GeV, and (d) Reconstructed n° mass by the measurement of cluster positions using FCM. Statistical errors are indicated in the 
figure. 
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Figure 8: (a) Raw data points from the fourth layer of the calorimeter and the clusters found by using the simple FCM algorithm. The minimum of 
Xie-Beni validity index was obtained for the two cluster centers shown, (b) The results of clustering obtained by using the dFCM algorithm. All 
the clusters are properly accounted for. 
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Figure 9: Number of clusters identified by the dFCM algorithm for known sets of eight clusters. The histogram is normalized to 100 clusters. A 
peak is observed around 8 clusters, and within ±1, a total of 80 clusters can be accounted for. 
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of the dFCM clustering in terms of number of identified clusters has been plotted for the case of eight known clusters. 
The histrogram is normalized to 100. A peak is observed around 8 clusters, and within ±1, a total of 80 clusters can 
be accounted for. That means for known 8 clusters, about 80% of the clusters can be found with number of clusters 
between 7-9. This tells about the quality of the clustering. 

A note on the validity index: In order for a validity index to accurately select a set of cluster centers that represent 
a set of clusters well, a clustering algorithm must be capable of finding the clusters in the first place. That is to say, 
the clustering algorithm must find cluster center coordinates in such a way that minimizes the validity index that will 
ultimately be used to select the best set of coordinates. Even though the FCM algorithm performed clustering from 2 
to 10 cluster centers, the coordinates that it found when it searched for 8 cluster centers did not eventually yield the 
minimal value of the Xie-Beni index. The index selected the better representative of the data to be 2 cluster centers. 
However, when the same index was used to select the best set of clusters found by dFCM, the minimal value was 
obtained when all 8 individual clusters were resolved. 

While it is true that validity indices are by no means generic, it is safe to say that dFCM can at least find suitable 
clusters centers when searching over a range of them, whether or not the validity index can identify them. This is 
also true for FCM, which can be forced to find a certain set of clusters by selecting a single number of clusters to 
find. Keeping the role of the validity index in mind, the same experiment was repeated with the partition coefficient 
described by Equation [4] and its modification, described by Equation [5] With FCM, the two coefficients were maxi- 
mized when 2 cluster centers were obtained, whereas the PC selected 7 cluster centers and the MPC 9 cluster centers, 
with dFCM. Even when the cluster center ranges were modified to [3,10], FCM did not provide cluster centers that 
resolved the clusters well, and the validity indices indicated that having only 3 cluster centers was the better option. 

Finally, as an application of the clustering algorithm to calorimetric data, a simulation of two 50 GeV neutral 
pions, placed close together, were studied for the layers 4, 8 and 12 where higher resolution silicon pads are placed. 
The data points in Figure [10] show the hits on the 4th, 8th and 12th layers of the calorimeter. The four clusters can 
be seen to be reasonably close together. The dFCM clustering algorithm was carried out to identify the clusters. A 
cut off of 89 keV, corresponding to one MIP, was applied before the clustering. The left side panels of (a), (c) and (e) 
give the results of the clustering routine with only a threshold of one MIP, applied before clustering. It is seen that in 
all cases, the major clusters have been accounted for. Some scattered data points that have higher energy depositions 
give additional clusters, which are essentially outliers. It is found that the energy depositions and number of hit cells 
of the outlier clusters are much smaller compared to the four main clusters for each layer. Using these data sets, the 
behaviour of the outliers are studied. It turns out that most of the clusters in the outliers are made up of a few cells 
with low values of deposited energies. Thresholds on the number of cells per cluster and deposited energy can be 
used to eliminate the outlier clusters. The panels, (b), (d), and (f) on the right side of the figure show the results of 
clustering after the application of thresholds on the cell hits, which eliminate the outliers in each of the layers. In case 
of actual calorimetric data, the thresholds can be optimized and applied in order to obtain the photon clusters. 

7. Summary 

In order to reconstruct the physical information extracted from calorimeters and other particle detectors, a suitable 
clustering technique must be applied. The ability of fuzzy clustering to assign data points to more than one cluster at 
a time makes it a powerful tool when clusters overlap, as is often the case with large particle densities of high energy 
experiments. The FCM technique and its dynamic version, dFCM, have been evaluated to be used in calorimetric data 
reconstruction. 

A tungsten-silicon sampling calorimeter of 20 radiation lengths was configured using the GEANT4 package. The 
calorimeter consisted of 20 layers, each with one radiation thick tungsten and 300 micron silicon pad detectors. Layers 
4, 8 and 12 consisted of high resolution silicon pads, with 0.1 cm x 0.1 cm sized pixels. The remaining layers were 
silicon pad layers with cell dimensions of 1 cm x 1 cm. The fuzzy c-means (FCM) clustering technique was applied 
to all the layers for the two photon clusters obtained after the decay of a neutral pion. The study demonstrated how 
the clusters and photon tracks could be identified for different energies of neutral pions. The tracks are successfully 
identified by locating the clusters in the 3 high resolution pad layers, and the total energy of a given track is obtained by 
summing over clusters of all pad layers belonging to the track. This provides a method for successful n reconstruction 
in the calorimeter. 
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Figure 10: Hits on 4th, 8th and 12th layer of the calorimeter, obtained from two closely placed neutral pions, each of 50 GeV energy. The left 
panels, (a), (c), and (e) show the results of the dFCM clustering routine, shown by the large solid points. The right panels, (b), (d) and (f) show the 
clusters which survive after a cut on the number of cell hits in a cluster. 
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The dynamic version of the FCM algorithm (dFCM) allows clusters to be generated and eliminated as required 
as data streams in. The dFCM algorithm successfully identified the individual clusters in a non-uniformly distributed 
set of 8 photon clusters, whereas FCM was shown to consider a group of clusters to be one large cluster. The dFCM 
technique was also applied to a simulation of two 50 GeV neutral pions decaying to two photons each. Each of the 
four photons were easily identified, despite the presence of small clusters of outliers. The energies and the number 
of points in each of the outlier clusters are usually small enough for a suitable elimination criteria to be defined. 
However, as different energies as well as different layers give rise to varying densities and different cluster structures, 
a meticulous study needs to be performed in order to come up with proper discrimination criteria. The discrimination 
criteria will have to be tuned in a realistic environment with different colliding systems and colliding energies. 
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